[SPARK-14388][SQL] Implement CREATE TABLE #12271

andrewor14 · 2016-04-09T06:56:17Z

What changes were proposed in this pull request?

This patch implements the CREATE TABLE command using the SessionCatalog. Previously we handled only CTAS and CREATE TABLE ... USING. This requires us to refactor CatalogTable to accept various fields (e.g. bucket and skew columns) and pass them to Hive.

WIP: Note that I haven't verified whether this actually works yet! But I believe it does.

How was this patch tested?

Tests will come in a future commit.

We need to reconcile the differences between what's added here in SparkSqlParser and HiveSqlParser. That will come in the next commit. This currently still fails tests, obviously because create table is not implemented yet!

Before: CatalogTable has schema, partitionColumns and sortColumns. There are no constraints between the 3. However, Hive will complain if schema and partitionColumns overlap. After: CatalogTable has schema, partitionColumnNames, sortColumnNames, bucketColumnNames and skewColumnNames. All the columns must be a subset of schema. This means splitting up schema into (schema, partitionCols) before passing it to Hive. This allows us to store the columns more uniformly. Otherwise partition columns would be the odd one out. This commit also fixes "alter table bucketing", which was incorrectly using partition columns as bucket columns.

This involves reverting part of the changes in an earlier commit, where we tried to implement the parsing logic in the general SQL parser and introduced a bunch of case classes that we won't end up using. As of this commit the actual CREATE TABLE logic is not there yet. It will come in a future commit.

andrewor14 · 2016-04-09T06:58:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

- *    ALTER TABLE table1 RENAME TO table2;
- *    ALTER VIEW view1 RENAME TO view2;
- * }}}
- */


moved to tables.scala. I just moved this one command for now to avoid inflating the diff too much.

andrewor14 · 2016-04-09T07:01:32Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4


 fileFormat
-    : INPUTFORMAT inFmt=STRING OUTPUTFORMAT outFmt=STRING (SERDE serdeCls=STRING)?
-      (INPUTDRIVER inDriver=STRING OUTPUTDRIVER outDriver=STRING)?                         #tableFileFormat


@hvanhovell I deleted the INPUTDRIVER and OUTPUTDRIVER here because Hive doesn't support it. Why was this added in the first place? Is there any supporting documentation for this somewhere?

@andrewor14 I wanted to make sure we supported the same grammar as Hive and I used their grammars as a basis. So this is defined in the following two locations:

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g#L1335-L1336

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g#L1925-L1926

The main idea was that I could throw better errors. But if it is not supported by Hive itself then please remove it!

Let's remove it. I have never seen this before and it is not documented anywhere.

SparkQA · 2016-04-09T07:19:30Z

Test build #55425 has finished for PR 12271 at commit 5e0fe03.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-09T09:03:07Z

Test build #55426 has finished for PR 12271 at commit f7501d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-04-09T16:39:07Z

@andrewor14 Just want to let you know, @xwu0226 is doing the command SHOW CREATE TABLE. He is writing many test cases in the PR: #12132. It might help you in this PR. Thanks!

yhuai · 2016-04-09T17:15:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

+ *   [COMMENT table_comment]
+ *   [PARTITIONED BY (col3 data_type [COMMENT col_comment], ...)]
+ *   [CLUSTERED BY (col1, ...) [SORTED BY (col1 [ASC|DESC], ...)] INTO num_buckets BUCKETS]
+ *   [SKEWED BY (col1, col2, ...) ON ((col_value, col_value, ...), ...)


@andrewor14 I just did a quick check with our InsertIntoHiveTable command. Seems this command does not really understand how to handle a table having specifications on CLUSTERED BY, SORTED BY or SKEWED BY. How about we just throw exceptions when a define provide these specs?

SparkQA · 2016-04-12T00:15:35Z

Test build #55552 has finished for PR 12271 at commit 2e95ecf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-04-12T02:19:56Z

I'll have a look at the failing tests in a couple of hours.

yhuai · 2016-04-12T17:52:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveSqlParser.scala

+   *   [COMMENT table_comment]
+   *   [PARTITIONED BY (col3 data_type [COMMENT col_comment], ...)]
+   *   [CLUSTERED BY (col1, ...) [SORTED BY (col1 [ASC|DESC], ...)] INTO num_buckets BUCKETS]
+   *   [SKEWED BY (col1, col2, ...) ON ((col_value, col_value, ...), ...) [STORED AS DIRECTORIES]]


Remove this line while resolving the conflicts?

Previously we always converted the data type string to lower case. However, for struct fields this also converts the struct field names to lower case. This is not what tests (or perhaps user code) expects.

SparkQA · 2016-04-13T00:09:02Z

Test build #55658 has finished for PR 12271 at commit 50a2054.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-13T00:55:26Z

Test build #55666 has finished for PR 12271 at commit 8dc554a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-04-13T01:05:57Z

I've ignored some tests in HiveCompatibilitySuite for now. I'll have a look at them shortly. I just wanted to see if the rest of the tests will pass.

SparkQA · 2016-04-13T02:13:27Z

Test build #55674 has finished for PR 12271 at commit 8e273fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-13T04:18:57Z

...ompatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala

    "date_join1",
    "date_serde",
-    "decimal_1",
+    //"decimal_1", // TODO: cannot parse column decimal(5)


Yea. We should support decimal(5). At here, the scale is 0.

https://issues.apache.org/jira/browse/SPARK-14591

yhuai · 2016-04-13T06:16:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveSqlParser.scala

+        // just convert the whole type string to lower case, otherwise the struct field names
+        // will no longer be case sensitive. Instead, we rely on our parser to get the proper
+        // case before passing it to Hive.
+        HiveMetastoreTypes.toDataType(col.dataType.getText).simpleString,


I think we need to use parseDataType provided in AbstractSqlParser.

There were a few differences in DESCRIBE TABLE: - output format should be HiveIgnoreKeyTextOutputFormat - num buckets should be -1 - last access time should be -1 - EXTERNAL should not be set to false for managed table After making these changes out result now matches Hive's.

CatalystSqlParser knows how to parse decimal(5)!

hvanhovell · 2016-04-13T07:04:46Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveSqlParser.scala

+        // just convert the whole type string to lower case, otherwise the struct field names
+        // will no longer be case sensitive. Instead, we rely on our parser to get the proper
+        // case before passing it to Hive.
+        CatalystSqlParser.parseDataType(col.dataType.getText).simpleString,


MINOR/NIT: The DataType parsing is done in the AstBuilder, so we really don't need to parse this string again. You could use (magical/evil) typedVisit[DataType](col.dataType) here.

SparkQA · 2016-04-13T07:26:44Z

Test build #55692 has finished for PR 12271 at commit 59edce3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-13T07:37:13Z

Test build #55693 has finished for PR 12271 at commit a60e66a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-13T07:38:40Z

Test build #55696 has finished for PR 12271 at commit 02738fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-13T15:52:40Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

    table.storage.outputFormat.map(toOutputFormat).foreach(hiveTable.setOutputFormatClass)
-    table.storage.serde.foreach(hiveTable.setSerializationLib)
+    hiveTable.setSerializationLib(
+      table.storage.serde.getOrElse("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))


hmm... Without this PR, the storage related fields of the table defined in org.apache.spark.sql.hive.MetastoreDataSourcesSuite.persistent JSON table are:

[SerDe Library: org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe ] [InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat ] [OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat ]

Looks like the change at here somehow breaks the test (without the change, we can somehow trigger a weird code path in Hive and get at least have one column called col).

For now, if the schema does not have any field, can we set it to the following one?

[# col_name data_type comment ] [ ] [col array<string> from deserializer ]

So, we try to preserve the existing (and weird) behavior.

I submitted PR to try it out (#12363). You can find my change in the last commit (ab70cb7).

why are there all these undocumented implicit behaviors in Hive :(

Now, after a few bad experiences, I think we need to add all the test cases to ensure the Hive APIs work as what we expected. See another issue I hit in alter table ... drop partition: #12220 (diff)

I feel it is better to have more checks in SessionCatalog to make sure that a request is valid (it can help your case).

Yeah, agree if you are talking about my case. : )

andrewor14 · 2016-04-13T16:40:59Z

OK, I cherry-picked your changes and updated the comment.

SparkQA · 2016-04-13T18:07:01Z

Test build #55727 has finished for PR 12271 at commit 55957bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-13T18:07:25Z

OK. Merging to master.

Andrew Or added 4 commits April 8, 2016 14:41

Parse various parts of the CREATE TABLE command

014c38e

We need to reconcile the differences between what's added here in SparkSqlParser and HiveSqlParser. That will come in the next commit. This currently still fails tests, obviously because create table is not implemented yet!

Implement it

5e0fe03

andrewor14 reviewed Apr 9, 2016
View reviewed changes

Revert unnecessary changes (small)

f7501d9

andrewor14 reviewed Apr 9, 2016
View reviewed changes

xwu0226 mentioned this pull request Apr 9, 2016

[SPARK-14346] [SQL][WIP]Show create table #12132

Closed

yhuai reviewed Apr 9, 2016
View reviewed changes

Andrew Or added 3 commits April 11, 2016 13:39

Merge branch 'master' of github.com:apache/spark into create-table-ddl

66970a8

Address comment

3af954d

Add all the tests

2e95ecf

andrewor14 changed the title ~~[SPARK-14388][SQL][WIP] Implement CREATE TABLE~~ [SPARK-14388][SQL] Implement CREATE TABLE Apr 11, 2016

Merge branch 'master' of github.com:apache/spark into create-table-ddl

c8edb75

yhuai reviewed Apr 12, 2016
View reviewed changes

Andrew Or added 2 commits April 12, 2016 16:42

Fix ParquetMetastoreSuite

8dc554a

Previously we always converted the data type string to lower case. However, for struct fields this also converts the struct field names to lower case. This is not what tests (or perhaps user code) expects.

Fix SQLQuerySuite

a4f67f2

Andrew Or added 2 commits April 12, 2016 18:01

Fix HiveCompatibilitySuite (ignored some tests)

045820c

Fix HiveDDLCommandSuite

8e273fd

yhuai reviewed Apr 13, 2016
View reviewed changes

Fix SQLQuerySuite CTAS

59edce3

yhuai reviewed Apr 13, 2016
View reviewed changes

Andrew Or added 3 commits April 12, 2016 23:26

Fix last ignored test in HiveCompatibilitySuite

a60e66a

CatalystSqlParser knows how to parse decimal(5)!

Merge branch 'master' of github.com:apache/spark into create-table-ddl

02738fe

hvanhovell reviewed Apr 13, 2016
View reviewed changes

yhuai reviewed Apr 13, 2016
View reviewed changes

Preserve an existing behavior.

55957bd

asfgit closed this in 7d2ed8c Apr 13, 2016

andrewor14 deleted the create-table-ddl branch June 22, 2016 17:48

[SPARK-14388][SQL] Implement CREATE TABLE #12271

[SPARK-14388][SQL] Implement CREATE TABLE #12271

Uh oh!

Conversation

andrewor14 commented Apr 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 9, 2016

Uh oh!

SparkQA commented Apr 9, 2016

Uh oh!

gatorsmile commented Apr 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

andrewor14 commented Apr 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

andrewor14 commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

yhuai commented Apr 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development